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SI  APP:  A  special  purpose  multiprocessor  arras  lor  signal  processing  ami  linear  algebra 

I.  I  Symanski.  loin  Heiulerson.  Judy  Slurasago.  John  Cello.  Harrs  Drake 

Nasal  Ocean  Sy stems  Center 
San  Diego.  CA  92152 


ABSTRACT 

The  kc>  to  meeting  the  extremely  high  throughput  requirements  ot  future  military  signal  and  image  processing  systems  is  parallelism 
in  algorithms  and  hardssarc.  This  paper  ss  ill  describe  the  implementation  ol  a  core  set  ol  algorithms  on  one  possible  hardssarc  implemen¬ 
tation.  designed  to  achieve  high  speed  and  efficient  parallelism,  fins  approach  and  design  procedure,  while  using  currently  available 
integrated  circuit  building  blocks,  is  similar  to  how  this  type  ol  processor  will  be  developed  in  the  future  using  VI  SI. 

I.  INTRODUCTION 


It  is  now  almost  a  decade  since  Rung  and  Leiserson1  first  described  a  systolic  array.  Many  advances  have  been  made  and  several 
machines  are  in  use-.  Algorithms  are  under  development  which  will  take  advantage  of  the  parallelism  available  in  systolic  and  other 
architectures.  This  paper  will  describe  work  aimed  specifically  at  efficient  implementation  of  a  core  set  ol  algorithms  for  use  in  I  incut 
algebra  and  signal  processing.  The  architecture  of  the  SI.APP  has  been  described1  previously.  This  paper  will  describe  data  movement  in 
the  algorithms  of  interest,  namely,  matrix  multiplication.  QRU.  SVD  and  generali/ed  SVD.  as  well  as  some  detail'  ol  a  bit-slice 
microprogrammed  implementation  of  the  SI  APP. 


2.  MATRIX  MULTIPLICATION 


Matrix  multiplication  is  the  easiest  algorithm  to  implement  due  to  its  regularity  and  use  of  only  multiplies  and  adds.  It  tequiics  n 
operations  which  can  be  done  in  (Xn)  time  on  n:  processors.  There  are  several  wavs  to  implement  the  algorithm  which  depend  on  whcthci 
the  data  is  outside  the  array  or  has  been  loaded  into  the  array.  For  data  coming  from  outside  the  array  the  engagement  process  ot  Speisei 
and  Whitchouse'1  requires  Jn-2  steps.  For  data  within  the  array,  the  process  described  by  Symanski’  requires  2n-l  steps.  The  SI  APP  array 
could  implement  the  engagement  algorithm  more  easily  by  not  requiring  skewing  of  the  data  by  the  host  which  complicates  the  host  It) 
and  programming.  The  data  could  simply  be  dumped  into  the  top  row  and  left  column  of  the  SI.APP  array  and  stored  in  auxiliary  or  dual 
port  RAM.  Appropriate  routines  and  control  message  passing  could  distribute  ihc  data  down  the  columns  and  across  the  rows  10  complete 
the  algorithm. 


3.  QR  DECOMPOSITION  (ORP) 


The  QRD  algorithm  of  F.  T.  Luk*  utilizes  two  types  of  processors:  boundary  processors  which  lie  on  the  main  diagonal  and  internal 
processors.  Sine/cosine  pairs  for  each  row  arc  generated  in  the  boundary  processors  and  then  passed  down  the  row  to  he  applied  to  the 
internal  processors.  Each  processor  operates  on  a  2x2  sub-matrix.  Data  is  input  to  the  systolic  array  through  the  top  processors.  The  data 
is  time  skewed,  so  that  each  pair  of  columns  is  skewed  one  row.  This  skewing  ensures  that  the  coreet  sine  cosine  pair  operates  on  the 
appropriate  two  rows  of  the  data  matrix. 

A  C  program  was  written  to  simulate  the  data  movement  of  the  above  QRD  algorithm  using  the  unified  architecture.  For  an  M  x  X 
data  matrix  with  the  unified  architecture,  there  arc  two  virtual  processors  per  physical  processor.  Each  virtual  processor  contains  seven 
elements:  the  2x2  sub-matrix  (UL,  UR,  LL,  LR|,  sine,  cosine,  and  temp.  Temp  is  a  memory  buffer  that  exists  for  correct  data  movement. 
The  lower  right  (LR)  element  of  the  2x2  sub-matrix  must  be  held  in  each  processor  one  additional  time  step  so  that  the  correct  data  get' 
moved  to  the  next  processor  at  the  appropriate  time.  Only  the  data  in  the  LL  and  LR  elements  move.  The  UL  and  UR  elements  are 
updated  until  they  converge  to  R.  the  upper  triangular  result.  Order  of  movement  for  the  LL  and  LR  elements  are  as  follows: 

LL  -  LR 
TEMP  -  LL 
LR  -  TEMP 

Since  there  are  two  virtual  processors  per  physical  processor,  there  is  data  movement  within  as  well  as  between  physical  processors. 
After  the  sine/cosine  pairs  are  generated  and  applied  to  the  boundary  processors,  they  are  passed  one  processor  to  the  right  every  time  step 
to  be  applied  to  the  internal  processors.  See  Figures  I  and  2. 
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A  C  simulation  is  being  written  for  the  data  movement  of  the  SVD  algorithm  using  the  unified  architecture.  For  an  M  x  8  upper 
triangular  matrix,  there  are  three  virtual  processors  (two  even  and  one  odd)  in  the  physical  processors  on  the  main  diagonal  and  two 
virtual  processors  [one  even  and  one  odd]  in  the  internal  physical  processors.  Each  virtual  processor  contains  eight  elements:  the  2x2  sub¬ 
matrix  [UL,  UR,  LL,  LR]  and  the  two  sine/cosine  pairS  (cosj,  sin,,  cosk,  sink].  Data  movement  is  between  as  well  as  within  physical  pro¬ 
cessors.  CoSj/sinj  are  passed  horizontally  one  virtual  processor  to  the  right  and  cosk/sink  are  passed  vertically  one  virtual  processor 
upward.  See  Figures  3  and  4. 


Figure  4.  SVD  Data  Flow  in  the  SLAPP  Array 

I - 1 

ODD  PROCESSOR  |  t  EVEN  PROCESSOR 

I _ J 


Figure  3.  SVD  Data  Flow  in  the  SLAPP  Processor 


As  mentioned  above,  there  are  two  sets  of  sine/cosine  rotations  generated.  Equations  used  in  the  simulation  for  the  sine  cosine 
generations  are: 


Luk’s  Equations: 

srho  *  (w  +  z)  -V  x; 
sin,  =  sign(srho)  -  sqrtil  srho:>; 
cos,  =  sin,  •  srho; 

den  =  2  •  sin,  •  w; 
num  =  cos,  •  (z  -  w)  +  sin,  •  \; 
krho  =  num  -  den; 

t  =  -sign(krho)  •  (abs(krho)  +  sqrtil  -i-  krho2)|; 
cosk  =  I  -  sqrtil  +■  t:); 
sink  =  cosk  •  t; 

Overflow  Equations: 

srho  =  x^  (w  +  z);  (  if  |xl  <  w  +  zl| 

cos,  »  I  -  sqrtil  +  srho2); 
sin,  =  cos,  •  srho; 

den  =  2  •  sin,  •  w;  (if  Ideni  <  inumll 

num  »  cos,  •  (z  -  w)  +  sin,  •  x; 
krho  ■=  den  num; 

t  »  (-sign(krho)  •  abs(krho))  +  (I  +  sqrtil  -  krho2); 
sink  =  -sign(krho)  -  sqrtil  +  t2); 
cosk  =  sink  •  t; 


1  ^  AR  FArm  ISA  rji  nil  rj  -.  If*  LltfH  l>  WIW!  LW  V V  \  V.yjV.  »V  jq  M 
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cosj  =  coss  •  cosk  -  sin*  •  sink; 
sinj  =  cos*  •  sin^  4-  sins  •  cosk; 


K  = 


[cosk  sink1 
-sink  coskJ 


[  cos*  sins1 

J '  =  f  cos, 

-sin,! 

[-sin*  cos*l 

[sin  , 

cos,] 

A  = 


["] 


Diagonalization  of  a  2x2  matrix  is  done  in  two  steps:  symmetrization  and  annihilation  (D  =  }'  •  A  •  K).  K  is  the  annihilation  rota¬ 
tions  matrix  and  S  is  the  symmetrization  rotations  matrix.  In  the  simulation,  the  sines  and  cosines  for  J  are  computed  from  the  sines  and 
cosines  of  S  and  K.  This  saves  one  matrix-matrix  multiply. 

For  matrices  with  a  column  dimension  higher  than  the  number  of  columns  in  the  systolic  array,  the  physical  processors  will  hase  more 
virtual  processors  in  them  as  in  the  (JR.  For  an  M  x  16  upper  triangular  matrix,  the  physical  processors  on  the  main  diagonal  would  now 
have  eight  virtual  processors  [five  even  and  three  odd]  in  them  while  the  internal  physical  processors  would  contain  lour  each  of  the  odd 
and  even  processors. 

GENERALIZED  SVDlGSVDl 

Two  preparatory  QRD's  are  required  for  the  compulation  of  the  GSVD.  One  matrix  is  led  into  each  ol  the  arrays;  \  into  arras  Ta  and 
B  into  array  Tb.  After  the  QRD  of  each  matrix  is  computed,  an  implicit  SVD  of  R  ,Rs  1  begins,  l  or  simplicity  we  consider  only  one  odd 
rotation  of  a  typical  sweep.  Also.  R.,.  R„  are  2  x  2  submatrices  of  R  y.  RH,  respectively. 

After  the  elements  in  the  boundary  processors  of,  say,  array  Tb  have  been  communicated  to  the  boundary  processors  ol  array  Ta.  a 
2x2  matrix  C  is  formed  in  each  of  the  overlapped  boundary  processors.  The  elements  of  C  in  a  typical  boundary  processor  are  given  b\ 


s'u  =  ^  rakl  adj<R„)l( 


where  the  adjugate  matrix  adj  (X)  =  Det(X)  X  1  is  computed  to  avoid  computation  of  the  determinant  C  is  upper  triangular  since  the 
submatrices  R„  and  Rb  are  upper  triangular  in  the  boundary  processors.  Next,  rotation  sines  and  cosines  are  computed  io  zero  out  the  oil 
diagonal  elements  of  C  in  each  boundary  processor  as  described  in  [7,9).  The  equation 

fd,  01 

[  o  d;J  =  r  CK 

=  J'  RaRh  'K 

J'RalK’Rs)  1 

shows  where  to  send  these  rotation  sines  and  cosines  within  the  arrays  Ta  and  Tb  The  c,,  s,  pair  propagate  to  the  east  through  array  Ta 
where  they  are  applied  to  the  elements  of  R.,.  Likewise,  the  Cy,  sy  pair  is  sent  east  in  array  Tb  where  they  are  applied  to  the  elements  ol  Ri,. 

Since  the  previous  rotations  destroy  the  upper  triangular  structure  of  J'  Ra  and  K'  Rh.  a  rotation  matrix  Q  is  computed  to  return  both 
to  upper  triangular  form.  This  is  possible  because  the  rows  of  Ra  and  Rh  are  parallel.  Thus. 

(J1  R,)Q.  (K'RslQ 

are  upper  triangular,  where  Q  can  be  computed  in  the  overlapped  boundary  processors  either  from  R  ,  or  Rh  c,„  s,_>  are  sent  north  through 
both  Ta  and  Tb  where  they  are  applied  to  RA  and  Rh.  respectively.  The  algorithm  continues  in  this  way  through  the  odd-exen  iterations 

until  Ry  and  Rg  have  parallel  rows,  i.e.. 

U'(Ry)((RB)  ')  V  =  D 
U'lRo  -  DV'(RhI 

where  D  is  diagonal  and.  hence  the  rows  of  the  LHS  are  just  a  scalar  multiple  of  the  rows  of  the  RHS. 

SLAPP  UNIFIED  ARCHITECTURE 

The  SLAPP  bitriangular  architecture  has  been  described  previously1.  Now  we  will  describe  the  unification  of  the  matrix- 
multiplication,  QRD,  SVD  and  GSVD  onto  the  SLAPP  architecture. 


•  -  •*  -  -  -  -  -  -  -  — - 
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Figure  2.  QRO  Data  Flow  in  the  SLAPP  Array 

SIN/COS - - 

Figure  1.  QRD  Data  Flow  in  the  SLAPP  Processor 

The  equations  used  in  ihc  simulation  lor  the  sine  cosine  generation  .ire: 

Luk's  Equations: 
rho  =  w  -  y; 

sin  =  sign(rho)  -  sqrtd  +  rho:); 
eos  =  rho  •  sin; 

Overflow  Equations:  (if  y  <  eps  •  wi) 

rho  =  y  -  w; 

eos  =  I  -  sqrtd  +  rho-); 

sin  rho  •  eos; 

where  w,  x.  y,  and  z  arc  the  elements  of  the  2x2  sub-matrix. 


For  matrices  with  a  column  dimension  higher  than  the  number  of  columns  in  the  systolic  array,  each  physical  processor  would  contain 
more  virtual  processors.  A  simulation  was  also  written  for  the  data  movement  of  an  M  \  16  data  matrix  using  ihe  unified  architecture. 
There  are  now  eight  virtual  processors  per  physical  processor  instead  of  two.  For  an  VI  s  32  data  matrix,  there  would  he  32  virtual  pro¬ 
cessors  per  physical  processor.  One  major  difference  in  these  simulations  is  there  arc  now  internal  as  well  as  boundary  virtual  processor,  in 
the  physical  boundary  processors. 

4,  SINGULAR  VALUE  DECOMPOSITION  tSVD) 

The  SVD  algorithm  of  F.  T.  Luk  *  is  an  algorithm  for  an  n  x  n  upper  triangular  matrix  R.  It  is  based  on  ihe  odd-even  ordering  ol 
Stewart*.  Like  the  OR  algorithm  above,  sine/cosine  pairs  are  generated  to  the  boundary  processors  These  rotations  are  then  applied  to  the 
rows  and  columns  associated  with  that  processor.  Again,  each  processor  operates  on  a  2x2  submatrix. 

With  the  odd-even  ordering,  there  are  two  kinds  of  processors:  odd  processors  are  those  with  the  upper  let t  element  of  the  2\2  sub 
matrix  being  odd  and  even  processors  are  those  with  an  even  upper  left  clement  The  algorithm  begins  with  an  c'dd  rotation  followed  hs  an 
even  rotation.  This  pattern  continues  until  convergence  A  rotation  consists  of  the  following:  sine  cosine  pairs  arc  computed  in  the  odd 
(even)  boundary  processors  and  applied.  Data  in  the  boundary  processors  is  sent  to  even  (odd)  processors  to  the  northwest,  northeast,  and 
southeast  and  the  sine/cosine  pairs  are  passed  to  neighboring  odd  (even)  processors  to  the  north  and  east  Alter  being  applied  to  these 
internal  processors,  data  is  sent  to  even  (odd)  processors  to  the  northwest,  northeast,  southwest,  and  southeast  and  the  sine  cosine  pairs 
are  passed  to  the  north  and  east.  The  even  (odd)  rotation  can  now  begin  since  the  boundary  processors  have  the  necc'sarv  data  to  start 
computing  their  sine/cosine  pairs.  Note  that  the  stne/costne  pairs  in  the  odd  (even)  processors  are  still  being  propagated  up  and  to  t he 
right  to  the  rest  of  the  odd  (even)  internal  processors.  This  occurs  since  a  processor  can  begin  working  as  soon  as  u  receives  all  data  and 
rotations  from  its  neighboring  processors. 
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**  :“lp‘,l"‘  -  r,ifa » *-  »  -  -  «« » « *. 

step  at  successive  processors.  Partial  products  are  accumulated  in  each  nrnc  *  SUCCesS‘ve  L0,umns  °l  A  and  rows  of  B  are  delayed  one  time 
result  matrix  is  stored  in  the  systolic  array  For  many  matrix  matrix  muhi^  *h  "'atriCes  n°w  ,hrou8h  lhe  array.  At  completion  the 

-  — *  - — » — - ■»-«■  -...“srs  ns  ss  z&Ttzsr*"  —• 

...  XU  ■'»«  -«««~  (Figute,  5,  «  5b,  A,  n,„te 

This  results  in  some  “SVD  ****«"  to  solve  the  *«•«  «*  Problem. 

ol  the  processors  are  idle  during  odd  (even)  cycles  The  unified  arch'  8  ls0'  due  10  odd-even  ordering  lor  the  SVD  algorithm,  half 


lb*  SVD  (cl  UNIFIED 

Figure  5.  The  Unified  Architecture 


...  •**  !?  r~» - «  » -»  svd.  . . 

“ORD  and  SVD  processors"  are  .,11,1,  or«e«o  Th  „  ,he  Un,"ed  archi,et,ure  <>'*  *>.  Now  « he  l.uk 

reduced  from  23  to  10.  f  d  processor  The  numbef  o'  Processors  required  to  factor  an  m  \  8  matrix  is 


,  JSf  ■»  •*■<«*-  — » «». *,  os™  „ 

giving  a  ihree  dimeniional  architecture  a,  depicted  in  Fig.  fi*  '  *  W°  ‘r,an8U  ar  a"ays  lo*cl*n'  nnd  connecting  cottespttndtng  pm 


Figure  8.  The  Bhrienguler  Unified  Architecture 


IUW1U  r  -«■_- umwwwxtwv 
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SLAPP  NODE  ARCHITECTURE 

The  archicecture  of  Ihe  SLAPP  node  is  shown  in  Figure  7.  The  components  contained  in  each  block  will  be  explained  below 

The  Input/Output  Processor  Control  (IOPC)  consists  of  a  sixteen  bit  sequencer  (IDT49C4I0),  test  circuitry  for  determining  the  state 
of  the  lOP  and  determining  conditions  for  microprogram  jumps.  Also  in  this  block  is  the  control  microcode  and  other  control  circuitry. 

The  five  IO  Ports  with  appropriate  buffer  and  control  circuitry  is  contained  in  the  IO  Ports  block.  This  circuitry  is  mostly  SSI  buffers 
and  control  circuitry  for  fast  block  IO  transfer  control.  This  accounts  for  a  large  number  of  packages  in  a  SSI  design.  For  highest  speed  a 
full  32  bit  parallel  approach  is  best  but  requires  many  pins.  Pin  count  could  be  brought  down  significantly  by  four  bit  wide  data  paths  or 
bit  serial  IO.  But  IO  transfer  rate  suffers.  The  IO  transfer  rate  may  not  be  the  limiting  factor  in  some  applications  so  the  narrowing  of  the 
IO  path  width  would  be  a  good  trade-off  parameter. 

The  FIFO  block  contains  two  FIFOs,  one  for  communication  of  commands  Irom  the  IOPC  to  the  Linear  Algebra  Processor  Con¬ 
troller  (LAPC).  The  IOPC  will  tell  the  LAPC  that  data  has  arrived  for  a  particular  operation  and  the  LAPC  will  execute  the  appropriate 
code.  When  the  LAPC  has  finished  the  computations,  it  sends  a  status  word  to  the  IOPC  via  ihe  second  FIFO  to  inform  the  IOPC  ihai 
data  can  be  moved  out  or  a  new  operation  can  be  started 

The  LAPC  is  similar  to  the  IOPC  in  that  it  is  a  lb  bit  microsequencer  wiih  testing  capability  to  determine  what  needs  to  be  done  and 
control  circuitry  and  to  perform  the  necessary  actions. 

The  Dual  Port  RAM  acts  as  the  data  storage  and  serves  to  decouple  the  iwo  processors  from  one  another  and  simplity  programming 
and  data  fetching.  The  dual  port  RAM  used  in  the  simulation  is  the  IDT7I32  42  I6K  (2k\8).  There  are  iwo  2kx32  bit  sections  ot  dual  port 
RAM.  each  connected  to  a  separate  bus.  This  is  to  allow  UO  with  two  of  ihe  five  ports  simultaneously  as  well  as  to  supplv  two  operands 
to  the  arithmetic  processor  in  one  cycle. 

The  Linear  Algebra  Processor  (LAP)  is  the  computation  unit  lor  the  SLAPP  node.  It  consists  ol  the  Bipolar  Integrated  Technology 
(BIT)  B21 10/2120  multiplier  and  ALU,  plus  a  small  register  file  for  temporary  storage  during  computations.  The  multiplier  and 
accumulator  chips  together  contain  about  125,000  devices 

The  Auxiliary  RAM  is  made  up  of  high  density  static  RAM  for  storage  of  large  amounts  ol  data  which  may  come  into  Ihe  node  dur¬ 
ing  computations  and  to  allow  distributed  memory  to  minimize  array  IO.  Control  circuitry  for  DMA  transfers  could  lake  20  l(  s  or  a  PI  I). 

A  rough  count  of  the  gates  and  device  equivalents  is  shown  in  Table  I.  The  device  estimates  use  four  devices  per  gate  and  one  device 
per  RAM  memory  bit.  Note  that  the  multiplier  and  accumulator  chips  account  for  almost  three  quarters  of  the  gale  couni  But  when 
memory  is  included,  the  microcode  and  RAM  require  about  eight  times  as  many  devices  as  the  logic.  The  relative  high  density  ol  memory 
to  the  lower  density  of  gates  and  wiring  may  make  this  comparison  less  important  since  silicon  area  is  the  importani  issue  rather  than 
device  count. 


Table  I.  The  SL  APP  Processor  Gate  and  Device  Count 


BLOCK 

ICs(VLSI) 

GATES 

RAM 

IOPC 

36  (6) 

2.800 

384  Kbits 

FIFO 

It) 

- 

1  Kbits 

LAPC 

30  (10) 

2.600 

640  Kbits 

IO  PORTS 

M) 

:.4<x) 

- 

DUAL  PORT  RAM 

30  (8) 

LOCK) 

128  Kbits 

AUXILIARY  RAM 

1?  (41 

500 

25^  Khii'v 

LAP 

16  (2) 

u.ooo 

~ 

HOST  INTERFACE 

10  (2) 

500 

I2S  Khif'< 

I”  02)  42.800  gates  I.JJ'.IKKI  bits 

171.200  devices  I.53MKM)  devices 

Total  devices:  1.700,000 

Calculations  use  4  transistors  per  gate  and  I  transistor  per  R  AM  bit 

Also,  the  divide  and  square  root  circuitry  within  the  multiplier  chip  requires  only  about  14, (MX)  devices  out  ol  the  I25.IMXI  devices  m  the 
floating  point  chip  set",  about  12%  of  the  total  logic  device  count  and  less  than  1%  of  the  total  device  count  of  the  SI  Al’P  node  This 
being  the  case,  the  use  of  less  capable  internal  processors  which  only  multiply  and  add.  does  not  seem  warranted.  This  is  especially  true  it 
one  wishes  to  reconfigure  the  array  in  some  manner  due  to  application  requirements  or  node  failure.  Having  all  nodes  be  boundary  pro 
cessors  simplifies  system  design  and  yields  far  better  system  flexibiliiv  and  fault  tolerance 
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6.  IMPLEMENTATION 

Due  to  the  high  speed  requirements,  complexity  and  longevity  of  military  systems,  the  NOSC  work  has  emphasized  a  high  speed,  pro¬ 
grammable  processor.  When  working  at  the  early  stages  of  algorithm  development,  there  is  really  no  way  to  specify  exact  speed  re¬ 
quirements.  Real  systems  are  far  too  complex  and  involve  too  many  operations  to  give  exact  estimates  as  to  the  speed  requirements  of  a 
particular  algorithm.  In  general,  there  is  never  enough  speed  or  memory.  If  one  is  dealing  with  an  application  which  is  very  well 
understood  and  the  hardware  and  software  will  never  be  changed,  it  may  be  possible.  This  is  rarely  the  case  in  complex  military  systems. 

The  physical  size  of  the  system  can  be  critical  in  some  applications.  Fabricating  an  array  with  commercially  available  devices,  the 
SLAPP  processor  would  require  about  200  integrated  circuit  packages.  The  same  design  could  be  reduced  to  about  40  packages  using  gate 
arrays  with  five  to  ten  thousand  gates  each.  The  current  processor  design  utilizes  32  LSI  devices.  The  gate  arrays  would  replace  only  the 
simpler  components. 

The  processor  could  be  implemented  in  one  package  if  full  custom  or  a  megacell  technology  were  used.  Rough  estimates  of  the  gate 
and  memory  requirements  indicate  that  the  processor  would  require  less  than  a  quarter  of  a  square  inch  of  silicon  using  0.5  micron  CMOS 
technology  proposed  for  VHSIC  Phase  2  in  the  I990si:. 

Design  time  grows  as  the  size  shrinks.  Even  with  the  availability  of  powerful  CAE  tools,  the  complexity  of  the  systems  tends  to  in¬ 
crease  errors  and  prolong  completion.  If  the  correct  tools  are  not  available  at  the  beginning  of  the  development,  the  acquisition  of  the 
tools  and  training  in  the  efficient  use  of  these  complex  tools,  takes  significant  time. 

Programming  of  sophisticated  routines  can  be  difficult,  especially  at  the  microcode  level.  For  efficient  implementation,  it  usually 
must  be  done  for  each  macro  routine  by  a  programmer  experienced  in  the  hardware  resources  of  the  machine.  A  powerful  compiler  could 
be  written  to  generate  microcode  using  a  high  level  language,  but  the  writing  of  an  efficient  compiler  is  no  simple  task  in  itself.  In  design¬ 
ing  a  processor  to  efficiently  implement  routines  it  is  highly  desirable  to  actually  test  the  routines  on  a  simulation  of  the  hardware  to  avoid 
costly  changes  and  iterations  later  in  the  hardware.  The  addition  of  a  register  file  or  another  data  path  or  more  memory  can  be  costly  once 
the  fabrication  process  has  begun.  Program  verification  can  be  done  with  the  appropriate  CAE  tools.  Of  course,  if  a  large  user  community 
is  to  be  supported,  a  high  level  language  is  essential.  The  Warp  and  iWatp  work  at  Carnegie  Mellon  is  an  excellent  example  of  this  kind  of 
large  scale  development. 

Simple  tools  for  writing  microcode  for  the  SLAPP  node  were  written  in  C  in  about  a  man-month.  There  are  four  programs.  One  lor 
generating  microcode  for  the  IOP  and  a  second  for  generating  code  for  the  LAP.  These  two  programs  take  text  files  written  with  an  or¬ 
dinary  ASCII  editor  and  convert  the  text  to  binary  files  used  by  the  simulation  program.  A  third  program  converts  floating  point  numbers 
into  binary  data  files  that  the  logic  simulation  program  can  use  for  input  and  a  fourth  program  converts  the  binary  data  files  generated  by 
the  simulation  to  IEEE  floating  point  numbers  the  user  can  read  easily.  These  tools  make  generating  and  interpreting  simulation  data 
much  easier. 


7.  SUMMARY 


A  core  set  of  algorithms  useful  for  many  applications  in  signal  processing  has  been  developed  and  mapped  onto  a  programmable  high 
speed  parallel  processing  architecture.  Simulations  of  the  algorithms  have  been  completed  on  the  IBM  PC  in  PC  Matlab  and  in  the  C 
language.  Unique  features  of  the  SLAPP  array  are  I)  independent  IO  and  computation  sequencers  in  each  processor  to  achieve  highest 
system  efficiency  and  ease  programming,  2)  a  high  speed,  non-pipelined  arithmetic  unit  capable  of  performing  divides  and  square  roots  in 
hardware.  A  gate  level  simulation  of  a  dual  microsequencer  implementation  of  the  architecture  has  been  entered  into  a  CAE  workstation 
and  logic  simulation  of  processor  operation  has  started. 
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